Hierarchical multimodal transformer to summarize videos
نویسندگان
چکیده
Although video summarization has achieved tremendous success benefiting from Recurrent Neural Networks (RNN), RNN-based methods neglect the global dependencies and multi-hop relationships among frames, which limits performance. Transformer is an effective model to deal with this problem, surpasses in several sequence modeling tasks, such as machine translation, captioning, etc. Motivated by great of transformer natural structure (frame-shot-video), a hierarchical developed for summarization, can capture frame shots, summarize exploiting scene information formed shots. Furthermore, we argue that both audio visual are essential task. To integrate two kinds information, they encoded two-stream scheme, multimodal fusion mechanism based on transformer. In paper, proposed method denoted Hierarchical Multimodal (HMT). Practically, extensive experiments show HMT achieves (F-measure: 0.441, Kendall’s ?: 0.079, Spearman’s ?: 0.080) 0.601, 0.096, 0.107) SumMe TVsum, respectively. It most traditional, attention-based methods.
منابع مشابه
Learning to score and summarize figure skating sport videos
This paper focuses on fully understanding the figure skating sport videos. In particular, we present a large-scale figure skating sport video dataset, which include 500 figure skating videos. On average, the length of each video is 2 minute and 50 seconds. Each video is annotated by three scores from nine different referees, i.e., Total Element Score(TES), Total Program Component Score (PCS), a...
متن کاملFor Your Eyes Only: Learning to Summarize First-Person Videos
With the increasing amount of video data, it is desirable to highlight or summarize the videos of interest for viewing, search, or storage purposes. However, existing summarization approaches are typically trained from third-person videos, which cannot generalize to highlight the first-person ones. By advancing deep learning techniques, we propose a unique network architecture for transferring ...
متن کاملHierarchical Spatial Transformer Network
Computer vision researchers have been expecting that neural networks have spatial transformation ability to eliminate the interference caused by geometric distortion for a long time. Emergence of spatial transformer network makes dream come true. Spatial transformer network and its variants can handle global displacement well, but lack the ability to deal with local spatial variance. Hence how ...
متن کاملA Hierarchical Approach to Multimodal Classification
Data models that are induced in classifier construction often consists of multiple parts, each of which explains part of the data. Classification methods for such models are called the multimodal classification methods. The model parts may overlap or have insufficient coverage. How to deal best with the problems of overlapping and insufficient coverage? In this paper we propose hierarchical or ...
متن کاملMultimodal Location Estimation of Videos and Images
Reading is a hobby to open the knowledge windows. Besides, it can provide the inspiration and spirit to face this life. By this way, concomitant with the technology development, many companies serve the e-book or book in soft file. The system of this book of course will be much easier. No worry to forget bringing the multimodal location estimation of videos and images book. You can open the dev...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Neurocomputing
سال: 2022
ISSN: ['0925-2312', '1872-8286']
DOI: https://doi.org/10.1016/j.neucom.2021.10.039